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Abstract 

Societies' norms of operation relies on the proper and secure functioning of 
several critical infrastructures, particularly modern power grid which is also 
known as smart grid. Smart grid is interwoven with the information and 
communication technology infrastructure, and thus it is exposed to cyber 
security threats. Intrusion tolerance proves a promising security approach 
against malicious attacks and contributes to enhance the resilience and secu- 
rity of the key components of smart grid, mainly SCADA and control centers. 
Hence, an intrusion tolerant system architecture for smart grid control cen- 
ters is proposed in this paper. The proposed architecture consists of several 
modules namely, replication & diversity, compromised/faulty replica detec- 
tor, reconfiguration, auditing and proxy. Some of distinctive features of the 
proposed ITS are diversity as well as the combined and fine-grained rejuve- 
nation approach. The security of the proposed architecture is evaluated with 
regard to availability and mean time to security failure as performance mea- 
sures. The analysis is conducted using a Discrete Time Semi Markov Model 
and the acquired results show improvements compared to two established in- 
trusion tolerant architectures. The viability of SLA as another performance 
metric is also investigated. 

Keywords: Smart grid security, Control center, Availability, SCADA, 
Intrusion tolerance. 
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1. Introduction 

In recent decade, the growing dependence of critical infrastructures on 
Information and Communication Technology (ICT) and open standards has 
raised serious concerns about security issues. Future power grid also known 
as smart grid exemplifies one of these critical infrastructures. In addition to 
environmental benefits by using renewable energy resources to reduce the car- 
bon footprint as well as the economic merits for both utilities and consumers 
(through dynamic pricing and active end-user participation), an outstand- 
ing feature of smart grid is the integration of fast, dependable and secure 
data communication networks to control and monitor the intricate power 
systems in an effective and intelligent way[l]. The cyber-physical dependen- 
cies (i.e., the combination of the legacy power grid and the communication 
networks and their interdependencies)[2], large-scale operation, heterogene- 
ity and complexity[3, 4, 5] along with sophisticated and novel attacks pose 
grave and new threats to the mission critical applications in particular the 
smart grid. Moreover, the security objectives of the smart grid differ from the 
ICT security goals in their order of significance. Availability and continuity 
of service is the main security priority [6]. Even the Quality of Service (QoS) 
requirements are different from ICT pre-requisites. Message delay is of great 
importance in smart grid whereas the data throughput receives special at- 
tention in the Internet [7]. On top of all the mentioned issues, the widespread 
and socioeconomic impacts of malfunction or failure of the smart grid result- 
ing from accidental or malicious events mandate more automatic and robust 
security solutions [8, 9]. 

The prime goal of any cyber-physical system such as smart grid is to offer 
smooth control over some physical process [10] which will result in consid- 
ering availability and integrity as the overriding security attributes. Thus, 
attacks on the control systems of cyber-physical systems can adversely affect 
their security and reliability. Some of the recent high-profile attacks have 
been mainly targeted at critical infrastructure control systems and crucial 
organizations. The Stuxnet worm, emerged in July 2010, aimed to control 
critical infrastructures. It exploited a vulnerability in MS Windows and 
attempted to modify the code running in Programmable Logic Controllers 
(PLCs) [11, 12]. Duqu, nearly a similar malware to Stuxnet, came into the 
spotlight in 2011. It acted as a Remote Access Trojan (RAT) with the pur- 
pose of information gathering but not specifically targeted at control systems. 
Nevertheless, among the compromised organizations were manufacturing of 
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industrial control systems [13, 14]. The most recent and novel threat (made 
public in May 2012), called FLAME[15], was a sophisticated and state-of-the- 
art malware with a modular structure able to perform information theft and 
took advantage of many attack and propagation methods. This espionage 
malware's end was some governmental organizations in Iran[16]. 

All the aforementioned security concerns and incidents serve as contribut- 
ing factors to change our mind set about the level of security that can be 
achieved through present security mechanisms especially for critical infras- 
tructures. Conventional security approaches (i.e., prevention and detection) 
proved insufficient to tackle the security problems[9], thus underscoring the 
need for a more resilient and robust security approach. To satisfy the men- 
tioned security requirements, a promising mechanism called intrusion toler- 
ance has come to existence. Thus, this approach to security has received 
considerable attention in recent years[8, 9, 17, 18, 19, 20, 21, 22, 23, 24, 25]. 
However, as stated in[26], the first usage of the term intrusion tolerance dates 
back to 1985 in[27] by Fraga and Powell. Intrusion tolerance is concerned 
with the fact that there is always probable for a system to be vulnerable to 
security compromise as well as for some attacks to be launched successfully 
on a system [8]. In spite of these assumptions, intrusion tolerance mecha- 
nisms ensure that the system prolongs its normal activities (or acts in a 
degraded mode providing only essential services) even when it is under at- 
tack or partially compromised. Thus, rather than preventing intrusions from 
happening in the system, they are permitted but tolerated by adopting and 
triggering appropriate mechanisms such as redundancy, diversity, rejuvena- 
tion, and so on. These techniques result in masking, removing or recovering 
from intrusions and preclude them from turning into security failures[26]. 
Consequently, the system remains highly survivable to malicious attacks and 
intrusions. Therefore, intrusion tolerance can be considered as a last re- 
sort security solution when other security measures fail to accomplish their 
intended purpose. 

To the best of our knowledge the significance of intrusion tolerance as a 
prospective security mechanism for smart grid has only been pointed out in 
the research carried out in[10] and[28]. This paper highlights the importance 
of intrusion tolerance approach which raises the possibilities for enhancing the 
security of critical components in the smart grid, particularly control centers 
and Supervisory Control and Data Acquisition (SCADA) systems. Hence, 
an Intrusion Tolerant System (ITS) architecture is proposed to strengthen 
and enhance the level of security in such systems. In addition, the security 
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attributes of the proposed architecture are evaluated using a semi Markov 
model. 

The paper is organized as follows. Section 2 highlights the security of the 
smart grid communication infrastructure especially in control centers. This 
section places an emphasis on the need for a robust defense-in-depth security 
approach to be adopted in smart grid control centers taking into account 
the limitations of the fundamental security mechanisms. Therefore, intru- 
sion tolerance is introduced as a promising security solution for smart grid. 
Section 3 provides a detailed analysis of intrusion tolerance. The difference 
between intrusion tolerance and fault tolerance along with classical security 
mechanisms (i.e., prevention and detection) are deliberated. The most com- 
monly used intrusion intrusion tolerance techniques are presented as well as 
a comparison is made between some of existing ITS architectures. In Sec- 
tion 4, a detailed discussion on the proposed intrusion tolerant architecture 
for smart grid control centers is presented. The performance of the proposed 
architecture is evaluated analytically and compared with established ITSs in 
Section 5. Finally, Section 6 draws the conclusion. 

2. Smart Grid Cyber Security as a Cyber Physical System 

As we mentioned in the earlier section, smart grid is the modernized 
power grid that is inextricably interwoven with information and communica- 
tion technology. Advanced features of such a complex system of systems 
include, but not limited to, two-way communications between customers 
and utilities, Demand Response (DR), Distributed Energy Resources (DER), 
sophisticated sensing technologies and real-time control and monitoring[29] 
along with self-healing capabilities. Enabling these functionalities requires 
an effective, reliable, secure and resilient communication infrastructure [30]. 
Figure 1 demonstrates a general view of smart grid communication infras- 
tructure. Home Area Networks (HANs) and Business Area Networks (BANs) 
comprise the level of communication infrastructure which is in close proximity 
to the electricity consumers. This section of the smart grid communication in- 
frastructure enables DR and active participation of end users through the use 
of smart meters. Geographically close HANs or BANs make up another level 
of the smart grid infrastructure hierarchy called neighborhood area network 
(NAN). It contributes to the exchange and sharing of information between 
electricity distribution facilities and consumers' premises. Finally, Wide Area 
Network (WAN) furnishes the smart grid infrastructure with the backbone 
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Figure 1. Smart grid communication infrastructure. 



to transmit control commands and monitoring signals from control centers, 
SCADA systems in particular, to electric devices located in substations as 
well as the real-time measurements from electric devices to the control cen- 
ters. It encompasses several NANs each formed under one substation[l]. 

Smart grid is regarded as a cyber-physical system, i.e., it is a system in 
which cyber security attacks can give rise to disruptions that go beyond the 
cyber domain and impact the physical world[3]. In other words, security at- 
tacks may occur in both the physical space (i.e., the traditional power grid) 
and cyber space (i.e., the communication networks). Security in both physi- 
cal and cyber domains is one of the principal objectives of the smart grid[30]. 
The U.S Department of Energy (DoE) has recognized attack resistance as one 
of the salient features needed for the operation of the smart grid[2]. Using 
open standard software and protocols have opened avenues for attackers to 
pose dire threats to different sections of smart grid communication infras- 
tructure particularly, SCADA systems. Furthermore, the escalating number 
of electrical outages and brown-outs worldwide during the last decade proved 
the power grid to be a potential target for malicious attacks. The cascaded 
power outages have arisen in Europe [31] as well as the ones come up in the 
United States and other countries[4, 32] are the consequences of such intru- 
sive measures. 

Control centers are considered as the brain of the smart grid [33]. They 
are in charge of data analysis and decision making[2]. Based on the assem- 
bled data, they make appropriate adjustments to power supply to satisfy 
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demand as well as spot and respond to the defects or failures by sending con- 
trol commands to field devices[30]. SCADA and Energy Management System 
(EMS) as the key components in control centers play a pivotal role in the 
proper operation of smart grid, any malfunction or failure of these systems 
may result in widespread and devastating effects (e.g., power outage, cascad- 
ing blackouts) on industry, economy and people's daily life. Other possible 
consequence of the control centers disruption is loss of consumer and public 
trust [33]. Therefore, the correct functioning of these systems in exigent secu- 
rity circumstances are of paramount importance. Several attacks on SCADA 
systems have been investigated in [34] according to their perpetrators as well 
as the industry sectors influenced by such attacks. Energy sector had been 
noticeably impacted by the SCADA incidents and attacks compared to the 
other industry sectors. The risks associated with the SCADA system that a 
government or company may faced with can be considered as financial loss 
or even injury or loss of life. Defects such as unpatched software, software 
bugs, buffer overflows and poor administration contributes to launching at- 
tacks against the SCADA systems. Two of the dire threats to SCADA/EMS 
are considered as Denial of Service (DoS) and unauthorized access/integrity 
breach [34]. These threats will result in the unreliability of the control signals 
from the monitoring system in addition to the measurement data gathered 
in the smart grid that are used for pricing or state estimation purposes. 
The possible ramifications would be massive brownouts or blackouts. For 
instance, SQLSlammer worm is a serious DoS attack against the control 
systems in the smart grid and any critical infrastructure. Thanks to the 
time-criticality of the communication and control in smart grid, a delay of a 
few seconds (following from an availability attack) may lead to irreparable 
harm to the national economy and security [30]. Figure 2 shows a control 
center which supervises multiple substations in smart grid[35]. The key com- 
ponents of the control center (i.e., SCADA/EMS) and the substations, in- 
cluding PLC, Remote Terminal Unit (RTU) and Intelligent Electronic Device 
(IED) are shown as well. 

The significance of a defense-in-depth security approach for smart grid 
has been highlighted in several published papers [12, 28, 32, 36] since it re- 
quires the adversaries to spend a great deal of time and effort to evade dif- 
ferent layers of defense. This layered approach would involve the adoption 
of best cyber security practices such as firewalls, Role-Based Access Control 
(RBAC), cryptography, Intrusion Detection and Prevention Systems (IDPS), 
and so on. However, these security mechanisms are subject to certain restric- 
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Figure 2. Control center and substations in smart grid. 



tions concerning their scope of operation and effectiveness. In addition to 
the limitations associated with the classical security approaches, there are 
other factors that impose restrictions on using some of these mechanisms in 
SCADA systems. Since the SCADA systems hinge on timely presentation 
of data, firewalls and anti-viruses may reduce the speed of data flow and 
subsequently lead to decreasing the accuracy of SCADA systems. In such 
circumstances, the SCADA operators tend to deactivate or bypass these se- 
curity mechanisms. Furthermore, patching SCADA systems is a tricky task 
due to introducing unknown impacts into the system that probably violate 
their correct operation and availability as well as the lack of comprehen- 
sive test environments. Network Intrusion Detection Systems (NIDSs) may 
compensate for such a limitation [34]. 

3. Cohesive Intrusion Tolerance for Smart Grid 

As stated in the previous section, intrusion tolerance shows enormous 
potential to be adopted and deployed in smart grid control centers. Intrusion 
tolerance and its paradigms enable secure and normal operation of the smart 
grid control centers, even when the system is being attacked or partially 
compromised. The primary goal is to tolerate malicious events and sustained 
attacks as well as masking, removing or recovering from intrusions. Thus, 
intrusion tolerance measures avert security failures and aid to maintain the 
availability of the system. Moreover, intrusion tolerance places emphasis on 
the impact of the attack rather than the cause of it. 
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3.1. A General Misconception about Intrusion Tolerance 

Some people may have a preconception about intrusion tolerance that 
leads to considering fault tolerance and intrusion tolerance as the same con- 
cepts. But in fact, fault tolerance can be considered as a predecessor of 
intrusion tolerance. Although these mechanisms have similarities especially 
in the techniques that they are using, some differences exist. The distinction 
between these two approaches lies in the nature of possible faults in a system. 
Fault is an imperfection or defect in the system that can give rise to an error 
which may result in a subsequent failure. Based on the definition of fault 
in[22, 37], it is viable to categorize faults into non-malicious and malicious. 
Non- malicious faults include accidental design flaws (e.g., software bug), de- 
liberate design defects following from constraints such as cost, environmental 
or natural perturbations or even a mistakenly action carried out by a dis- 
tracted operator. It is apparent that these types of faults are infrequent and 
occur at random. Fault tolerance deals with this class of faults. In con- 
trast, a malicious fault, also called an intrusion, is an intentional operational 
fault that stems from a successful attack on a system vulnerability. Intrusion 
tolerance should handle malicious faults, i.e., intrusions that are prevalent 
in information and communication systems and also critical infrastructures 
such as smart grid. 

3.2. Intrusion Tolerance versus Classical Security Mechanisms 

Intrusion tolerance is commonly referred to as the third generation of 
security technologies [38] which provides complementary features to conven- 
tional security mechanisms, i.e., prevention and detection. Some of the driv- 
ing forces behind the increasing tendency for employing intrusion tolerance 
techniques are as follows: 

• The growing number of novel and zero-day attacks and, thus the infea- 
sibility to prevent or detect all intrusions in an effective manner [8] 

• The sheer complexity of the systems that makes it impossible to pin- 
point all of their vulnerabilities prior to coming into operation [25] 

• Preventive security measures such as firewall, access control, authenti- 
cation and authorization mechanisms are mainly proactive [19] and do 
not guarantee perfect protection [8] 
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• Intrusion detection techniques including misuse detection and anomaly 
detection are based on property checks (i.e., comparing observed activ- 
ity with known patterns of attacks or normal behavior of the system) 
[22] and may result in high false positive or false negative rates 

• Detection methods are predominantly reactive with limited automated 
defense capabilities and require human intervention to conduct a post- 
mortem and deal with the identified security threats[39]. This may lead 
to a slow reaction in the face of attacks that require to be dealt with 
immediately (especially in critical systems such as smart grid control 
centers) 

In regard to the aforementioned issues and the fact that downtime, failure 
or malfunction of the smart grid control centers is not acceptable and must be 
kept at minimum, there is an urgent need for more automatic and resilient se- 
curity approaches. Therefore, intrusion tolerance through appropriate means 
(e.g., redundancy, dynamic and adaptive rejuvenation and reconfiguration) 
is vital to fulfill the security and survivability requirements of the smart grid. 

3. 3. Paradigms of Intrusion Tolerance 

Figure 3 illustrates several common paradigms of intrusion. These tech- 
niques assist in achieving the goal of intrusion tolerance which is provisioning 
correct service despite the presence of active attacks and intrusions. Although 
utilizing these methods incurs substantial costs such as performance costs, 
administration and maintenance costs, no expense is spared employing them 
in mission critical systems such as smart grid control centers in which adverse 
effects of intrusions may lead to higher expenses or even irrecoverable losses. 
The description of the most widely used paradigms of intrusion tolerance are 
as follows: 

3.3.1. Redundancy 

Redundancy is defined as allotting additional resources to a system that 
are more than its usual needs in normal functioning situations. There exists 
different types of redundancy including space redundancy, time redundancy 
and information redundancy among which space redundancy (i.e., physical 
resource redundancy or replication) has received considerable attention. In 
fact, replication is an indispensable part of intrusion tolerant systems. Repli- 
cated systems usually operate based on Byzantine Fault Tolerant (BFT) 
protocols, i.e., a system contains n replicas is able to work properly even if 
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Figure 3. Paradigms of intrusion tolerance. 



f < n replicas undergo arbitrary faults (/ is the number of tolerable faulty 
replicas) [20]. However, redundancy suffers from the underlying problem of 
fate sharing for replicas[37, 39]. If an adversary locates and exploits a vul- 
nerability in one replica, it is highly likely that all replicas are susceptible to 
the same threat. 

3.3.2. Diversity 

To alleviate the problem of fate sharing associated with redundancy, it is 
common to employ diversity as a complementary technique to redundancy. 
Diversity equips the replicas with security failure independence. Diversity 
also has some variants such as space diversity, time diversity and implemen- 
tation diversity. Operating System diversity (OS diversity) has gained mo- 
mentum to be adopted in intrusion tolerant systems. The reason is twofold. 
First, the availability, less complexity and cost-effectiveness of using various 
operating systems to provide diversity makes them an appropriate choice 
for applying diversity to ITSs. Second, many intrusions are concerned with 
vulnerabilities of different operating systems due their pivotal role in any 
system[40]. Operating system can be considered as one of the vulnerable 
parts of a system regardless of the robustness of the software running on top 
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of it [41]. 



3.3.3. Dynamic recovery and reconfiguration 

One of the contributing factors to intrusion tolerance is to dynamically 
reconfigure the replicas. This reconfiguration may involve measures such as 
rejuvenation (i.e., recovery), modifying the system's posture, rollover and 
load sharing among which rejuvenation is widely employed in ITSs. Rejuve- 
nation involves the restoration of a replica to a pristine state to eliminate the 
likely effects of intrusions or faults[39]. This may include the modification of 
the cryptographic keys or loading a clean copy of the operating system and 
applications [21]. 

Although the BFT protocols are effective in holding up the failure of the 
overall replicated system by a specific amount of time, they are highly depen- 
dent on the value of / and the degree of diversity in the replicated system[20]. 
Increasing the / incurs more cost on the system as well as it necessitates the 
rise of diversity degree which has a limited scope (e.g., in case of OS diversity, 
the number of existing operating systems are limited). Rejuvenation is con- 
sidered as an acceptable solution to the mentioned problems by decreasing 
the value of / and the duration of time the attacker has at his disposal to 
corrupt more than / replicas. Moreover, the constant availability require- 
ment along with the unknown execution time of critical infrastructures such 
as smart grid underline the need for a kind of recovery mechanism to make 
sure that the allowed maximum number of compromised components is not 
violated. However, for the rejuvenation to be effective, the rejuvenation rate 
should be kept more than the intrusion rate [42]. Another point is that the 
allowed number of faulty or compromised replicas (i.e., /) is an upper bound 
on the number of concurrent rejuvenations. The availability is violated when 
the total number of compromised replicas and the rejuvenations in progress 
exceeds /. Therefore, the total number of replicas should be more than n 
(as in BFT systems) to satisfy the availability requirements [43]. 

Rejuvenation would have a superior impact on enhancing the security 
of the system if it is combined with diversity (e.g., restoring the replica 
with a clean version of a different OS). This is due to the fact that the 
recovery process may eliminate the impacts of fault or intrusion but there 
is no guarantee that the rejuvenated replica compromised again exploiting 
the same vulnerability. The situation would get aggravated if the adversary 
has gained critical information (e.g., passwords, OS version) prior to the 
rejuvenation that may result in a more sophisticated form of attack following 
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the recovery. Rejuvenation can possess two different forms as follows: 

1. Proactive rejuvenation: Proactive recovery is the process of periodically 
rejuvenating the replicas. It assists in the identification of dormant 
faults (these faults may not even detected through detection mecha- 
nisms) or masking intrusions and should be conducted frequently suf- 
ficient to restrain the adversary from infecting more than / replicas 
during a proactive recovery period (assuming that no reactive recov- 
ery performed in this period). The downside of this method is that it 
may not be effective in an asynchronous system since the attacker can 
postpone the recovery of a compromised replica by manipulating the 
system's clock. As a result, he/she will have adequate time to com- 
promise more replicas than the system is able to tolerate[21]. Another 
possibility is that the attacker may be able to intrude the components 
at a rate faster than rejuvenation[42]. 

2. Reactive rejuvenation: This kind of rejuvenation complements the proac- 
tive recovery by speeding up the process of handling compromised repli- 
cas. It is usually triggered by intrusion detection mechanisms to rejuve- 
nate the suspected and faulty replicas. If a compromised replica cannot 
be identified by the adopted detection mechanisms in the system, there 
is no way to signal the reactive recovery to be performed. Hence, the 
intrusion will go undetected without raising any suspicion [38]. 

3.3.4. Voting 

Voting algorithms are employed to reach a consensus on the valid and final 
output of non-faulty redundant components in an ITS. Using some criteria 
such as edit distance (e.g., hamming distance) and hash codes make the 
comparison feasible. Voting contributes to masking and tolerating intrusions. 
Formalized majority voting and formalized plurality voting are some of these 
algorithms [39]. 

3.3.5. Secret sharing scheme 

Secret sharing or threshold scheme is based on the idea of concealing a 
piece of information by splitting it into several shares and distributing among 
participants in a manner that specific subsets of the shares are required to 
rebuild the initial data[44]. In regard to application in ITSs, the secret data 
can be the main information or the associated cryptographic key. The former 
entails storing the data shares in separate physical locations in a way that 
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Table 1. Proposed comparative analysis of intrusion tolerant architectures. 
(Y: Yes, N: No, O: Optional) 





COCA 


DIT 


Willow 


SITAR 


SCIT 


MAFTIA 


Crutial 


FOREVER 


Ceneric ITi 
for web 
servers 


Replication 


Y 


Y 


Y 


Y 


Y 


Y 


Y 


Y 


Y 


Diversity 


N 


Y 


Y 


Y 





Y 


Y 


Y 


Y 




Y 


Y 


Y 


N 


Y 


N 


Y 


Y 


Y 


Reactive Recovery 


N 


Y 


Y 


Y 


N 


Y 


Y 


Y 


Y 


Fine-grained Recovery 


N 


N 


N 


N 


N 


N 


N 


N 


N 


Voting/BFT Agreement 


Y 


Y 


N 


Y 


N 


Y 


Y 


Y 


N 


Proxy 


N 


Y 


N 


Y 


N 


N 


N 


N 


Y 


Intrusion Detection Capabilities 


Y 


Y 


Y 


Y 


N 


Y 


Y 


Y 


Y 


Secret Sharing 


Y 


N 


N 


N 


N 


Y 


N 


N 


N 



the confidentiality is maintained and the original information can be rebuilt 
even if a certain number of shares infected or compromised by attackers. In 
the second case, the key used to encrypt the data is broken down into shares 
so that a particular number of shares are needed to reconstruct the original 
key and access the data [39]. 

3.3.6. Acceptance testing 

Having different forms, including requirement test, reasonableness test, 
timing test, accounting test and coding test, acceptance testing is a program- 
mer or developer-provided error detection function in a software module to 
inspect the reasonableness of the generated results. This technique similar to 
redundancy and diversity has its root in the fault tolerance. However, being 
application dependent is regarded as one of the drawbacks of this technique. 
Therefore, creating appropriate and effective tests necessitates understanding 
the system painstakingly. More details on the acceptance testing measures 
can be found in [45]. 

3.3.7. Indirection 

Proxies, wrappers and virtualization are some of the indirection tech- 
niques that serve as additional layers of defense between servers and clients. 
In spite of their benefits, they incur the cost of overhead and latency so these 
factors should be taken into account when designing the system[39]. 

3.4- Intrusion Tolerant Architectures 

During the last decade, various research have been conducted on intru- 
sion tolerance and multiple intrusion tolerant architectures with specific fea- 
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tures and applications have been proposed. The Willow architecture [46], 
COCA[47], DIT[48], MAFTIA[49] , SITAR[50], SCIT[24], Crutial[9], FOR- 
EVER [43] and Generic intrusion tolerant architecture for web servers [25] 
exemplify a number of the proposed ITS architectures. Some of these ar- 
chitectures are application-specific. For instance, the goal of COCA is to 
provide a secure and fault-tolerant certification authority (CA) while Crutial 
is a distributed firewall-like intrusion tolerant system for critical infrastruc- 
tures protection such as power grid. But primarily, enhancing the security 
and availability of distributed services, Commercial Off The Shelf (COTS) 
servers and critical information systems have called for designing such ar- 
chitectures. The intrusion tolerance paradigms introduced in the previous 
section are commonly used in intrusion tolerant systems. Hence, they can 
be utilized to analyze and compare different intrusion tolerant architectures. 
Some representatives of existing ITS architectures have been compared by 
conducting qualitative analyses in[17, 18]. Table 1 depicts such analysis but 
with emphasis on the paradigms of intrusion tolerance employed in several 
ITSs. The spectrum of architectures have distinct features. We have devel- 
oped a comparative analysis to enable a clear reflection of their respective 
attributes. Moreover, our provided comparison encompasses a higher vol- 
ume of ITSs. As it can be seen, replication and diversity are the techniques 
adopted by almost all the ITSs. Although design diversity (e.g., using dif- 
ferent versions of OS) is the dominant type of diversity used by the ITSs, 
FOREVER and Crutial can employ time diversity (i.e., rejuvenation intro- 
duces diversity). Some ITSs such as FOREVER, Willow and Crutial apply 
a combined recovery method, that is, both proactive and reactive recov- 
ery whereas others like Self Cleansing Intrusion Tolerance (SCIT) only use 
proactive recovery. To the best of our knowledge, none of the existing ITS 
utilizes the fine-grained recovery strategy introduced in[38]. One of the in- 
direction techniques that is widely preferred is the use of proxies as the 
mediator between the COTS servers and the outside network. Intrusion de- 
tection whether anomaly-based or signature-based are very common among 
the ITSs. Byzantine agreement algorithms and secret sharing are the other 
intrusion tolerant mechanisms that have been implemented in some of the ar- 
chitectures. Among the ITSs shown in Table 1, SCIT and Scalable Intrusion 
Tolerant Architecture for Distributed Services (SITAR) have drawn more at- 
tention in published intrusion tolerance research and investigated with regard 
to their performance [40, 41, 42, 43]. 

One of the formidable challenges that the ITSs must handle is their 
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self-security against malicious manipulations and attacks. Therefore, sev- 
eral met hods/ modules in the ITSs are employed to meet this challenge some 
of them are as follows: 

• Sensor subsystem, runtime verification and private subnet between the 
proxy and other components in DIT architecture 

• Audit control in SITAR 

• One-way signal from controller to the servers in SCIT 

• Distributed trust throughout the system in MAFTIA 

• Wormhole in Crutial and FOREVER 

• Runtime verification in the Generic ITS for web servers 

Other important issues that should be addressed include the complexity, 
performance and cost. For instance, the relative complexity of SITAR is high 
whereas the associated complexity of SCIT is low [17]. Desired performance 
metrics are chosen with regard to the application. In case of the smart grid, 
increased complexity of the ITS may be an advantage for the system since 
it makes it difficult for the attackers to break into the system. However, 
the complexity of a proposed ITS architecture for smart grid should not 
considerably degrade the desired performance of it. For instance, delay is 
of paramount importance for the communications from control centers to 
substations [35]. In addition, the degree of redundancy and diversity are 
policy-dependent and should be set at the deployment time. 

4. Proposed ITS Architecture for Smart Grid Control Centers 

Typical ITSs have single primary focus such as SITAR, which is detection 
triggered, and SCIT, which is recovery based. Moreover, as it has been shown 
in Table 1, none of the existing ITS architectures employs the fine-grained 
rejuvenation approach. These issues along with the specific requirements of 
the smart grid control centers (e.g., delay sensitivity) underscore the need 
for a new ITS architecture that suits the smart grid control centers. The 
proposed ITS for SCADA systems in smart grid encompasses a rich blend 
of a wide spectrum of different intrusion tolerance techniques. As illustrated 
in Figure 4, the proposed system comprises five modules, namely replication 
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Figure 4. The proposed ITS Architecture. 



& diversity module, auditing module, compromised/faulty replica detector 
module, reconfiguration module and proxy module. The role and working 
principles of the aforementioned modules are elucidated in the following sec- 
tions. It should be noted that to avoid the proposed ITS from being com- 
promised by the intruders, it is assumed that all the components' tasks and 
their communications are performed in a trusted platform. Proxy module 
also helps to enhance the security of the ITS. 

4-1. Replication and Diversity Module 

The replication module consists of a number of replicas for a critical 
entity in the SCAD A systems of the smart grid such as Master Terminal 
Unit (MTU). The number of replicas should be at least 2f + 1 to tolerate / 
intrusions. In this module, the number of replicas is assumed to be 2f + l + K 
and the value of / (/ ^ 1) and k are indicated in the deployment time. The 
same approach also used to design a distributed firewall-like protection device 
named Crutial Information Switch (CIS) in[9]. The reason why the value of k 
is added to the number of replicas will be given in the reconfiguration module 
section. It should be noted that all replicas have OS diversity to decrease 
the probability of sharing the same vulnerabilities. OS is considered a vital 
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element of each replica on account of hosting the SCADA system. Thus, any 
misconfiguration or vulnerability in OS may bring down the SCADA system 
and causes the adversaries achieve breakthroughs [34]. 

4-2. Compromised/ Faulty Replica Detector 

This module aims at examining the responses/outputs of the replicas to 
identify possible infected/compromised ones. It is composed of the following 
two sub modules: 

1. Inspector: Acceptance testing as an intrusion tolerance technique is 
entailed in the inspector module. It involves application-specific checks 
with regard to the security policy to ensure the sanity of outgoing data 
from the replicas. Any symptom of security compromise detected by 
it will trigger the reactive recovery sub module in the reconfiguration 
module. 

2. Voting: This sub module is intended for masking the impacts of intru- 
sions as well as ensuring the integrity of replicas outputs. Based on 
a voting algorithm, it seeks for the correct output by comparing the 
redundant outputs from the active replicas that passed the inspector 
successfully. In this way, it will arrive at a consensus on the final de- 
sired output to be passed to the proxy module. This output can be 
a command or information from the SCADA critical components des- 
tined for a particular field device in smart grid. The invalid outputs will 
trigger the reactive recovery sub module of the reconfiguration module 
for their corresponding replicas. 

4-3. Reconfiguration Module 

Reconfiguration module consists of two sub modules namely, automatic 
rejuvenation and manual restoration. When the proposed ITS is able to mask 
an intrusion, it leverages the automatic rejuvenation sub module, otherwise 
it takes advantage of manual restoration which involves human intervention. 
Manual restoration happens when for instance the system is targeted by 
DoS attacks and only capable of provisioning the essential services (graceful 
degradation). The sub modules descriptions are as follows: 

1. Automatic rejuvenation: In this module, a combined rejuvenation ap- 
proach (i.e., proactive- reactive recovery) has been used to compen- 
sate for the shortcomings of the two aforementioned rejuvenation ap- 
proaches, i.e, reactive and proactive recovery. Thus, it will enhance the 
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performance of the system through decreasing the possible duration of 
time a compromised replica may disrupt the normal operation of the 
system[21]. Automatic rejuvenation module enables the concurrent re- 
juvenation of at most k replicas out of 2f + 1 + k (total number of 
replicas). The assumption for the total number of replicas eliminates 
the impact of compromised replicas (at most /) and recovery on the 
availability of the system. Proactive recovery (at the system level) 
is performed periodically through choosing an active replica based on 
smallest rejuvenation time stamp. Note that at most one replica is 
allowed to undergo this type of recovery at a time. Figure 5 shows the 
proactive rejuvenation mechanism. Reactive rejuvenation complements 
the proactive recovery. For reactive recovery, we have been inspired by 
a hierarchical recovery method that has been proposed recently in [38]. 
It encompasses three levels of recovery granularity, namely system level, 
object level and process level recovery. This model eliminates the need 
for complete recovery when the system is partly compromised. The 
merits of this model can be considered as reduced total recovery time, 
improved flexibility and dependability. In our system, reactive recov- 
ery can be triggered externally and at the system level by the com- 
promised/faulty replica detector module (and may introduce diversity) 
or internally (within a replica) in a hierarchical and fine-grained fash- 
ion (including process level recovery and system level recovery) almost 
similar to the strategy proposed in[38]. The maximum potential num- 
ber of replicas that can be under system level reactive recovery is k. 
Figure 6 depicts process level recovery. Process manager (can act as 
a type of Host-based IDS which features self-healing capabilities) is a 
module executed in each active replica to handle the process level re- 
covery. There are two sets of processes, namely active set (includes 
running processes) and standby set. Based on a timeout period, the 
process manager examines the pool of active processes. In the event 
of finding any suspected process, it will obtain the relevant checkpoint, 
kills the process and activates its peer from the standby set (if there 
is any) otherwise the system level recovery will be performed. The 
process level recovery is time-saving compared to system level recovery 
as well as it is more secure since it involves internal information and 
communication exchange in a machine. Moreover, it does not require 
the replica to go offline for performing the recovery. 
2. Manual restoration: This sub module is triggered when the intrusion 
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procedure proactive-rejuvenation( ) 

Wait(RejuvenationPeriod) ; 

while N oC oncurrentRejuvenations > k do 

W ait(RejuvenationPeriod) ; 
end while 

i <— Find(ReplicaWithSmallestTimestamp); 
Replica[i}. status <— recovery; 
NoC oncurrentRejuvenations + +; 
Replica [i] . SystemLevel Recovery () ; 
Replica [i] .SetTimestamp ( ) ; 
Replica[i]. status «— active; 

N oC oncurrentRejuvenations ; 

end procedure 

Figure 5. Proactive recovery mechanism. 

(whether detected or not) is non-maskable (e.g., more than / replicas 
have been compromised). This may cause the system to be in graceful 
degradation mode, stopped functioning mode or complete failure mode 
all of which require human intervention and corrective measures to 
return the to the normal working state. 

4-4- Auditing Module 

This module maintains audit logs for all modules. The logs would be 
useful for security administrator to monitor and analyze the operation of the 
system. 

4-5. Proxy Module 

The proxy module is placed on the boundary of the ITS architecture 
where the data comes in or goes out of the intrusion tolerant architecture. 
The proxy module shields the internal structure of the ITS from attackers as 
well as acting as a load balancer. When the state information of field devices 
or power usage data collected by smart meters (incoming data in Figure 4) 
gathered in field devices passed to the respective critical components in the 
control center, they go through the proxy module as the first layer of de- 
fense. This incoming data is then forwarded to the replication and diversity 
module to be dealt with. Moreover, the control commands from the SCADA 
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procedure process-manager( ) 
while not timeout do 

WaitQ; 
end while 

PollingQ; 

for all the suspected processes (j) in replica i do 
if Process[j}. Standby AvailableQ then 

Process[j].ObtainCheckpoint(Suspect); 

Process[j}. Kill (Suspect); 

Process[j}.ActivateStandby(); 
else 

if N oC oncurrentRejuvenations < k then 

Replica[i}. status recovery; 

N oC oncurrentRejuvenations + +; 

Replica [i] . SystemLevelRecoveryQ; 

Replica [i] . S etTimestamp() ; 

Replica[i}. status ^— active; 

N oC oncurrentRejuvenations ; 

Exit the for loop 
end if 
end if 
end for 

timeout = False; 
end procedure 

Figure 6. Process level recovery in a replica. 
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system (outgoing data in Figure 4) pass the proxy to reach the field devices 
in substations. Proxy module is composed of several proxies located in dif- 
ferent virtual machines that have diversity in their operating systems and 
are managed by a controller. Proxies can have three modes, namely online, 
offline, and cleansing. The number of online proxies can be one or more 
based on the decision of the controller. Depending on a defined exposure 
time for proxies and a round-robin algorithm, the controller deals with the 
rotation and changing turn between proxies[24]. When the exposure time 
requirement for a proxy is met, it will go through the rejuvenation process 
(or cleansing process) and will be in cleansing mode. Then, its mode will be 
altered to offline mode and it will be ready to be chosen by the controller to 
go online. 

4-6. An Attack Scenario 

We can describe the working principle of the proposed ITS architecture 
by an attack scenario. Suppose a possible intrusion scenario in which an 
attacker (an outsider or a malicious insider that has gained access to the 
SCADA system in smart grid and tries to infect one or more replicas of 
a critical component (the number of compromised replicas are less than or 
equal to /). Thus he/she would be able to issue control commands. It is 
also possible that the adversary makes the replica work not properly (e.g., 
by running a Trojan or changing some system files) which may result in 
sending inappropriate commands (in case of automatic operation). However, 
the command must first pass the compromised/faulty replica detector. It 
is highly probable that the compromised replica(s) being recognized by the 
inspector (using detection capabilities) or voting (due to the fact that the 
replicas have different operating systems, all of them may not be infected 
by the same attack targeted at a special type of vulnerability, and thus the 
generated responses would be different), so the command will not go further 
and the infected replicas will undergo reactive recovery. In addition, process 
manager running in each replica may detect the infection and trigger the 
process level rejuvenation. Even if the intrusion tolerance mechanisms fail to 
detect the intrusion, it is possible that the attack's impact is masked through 
proactive recovery. 
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5. Performance Analysis of the Proposed ITS Architecture 

Security quantification of the proposed ITS architecture is needed for 
assessing the outcome of the desired performance measures as well as per- 
formance comparison with other architectures. To achieve this goal, a state- 
space model is developed that incorporates an attacker's behavior along with 
the system's response to an attack or intrusion[19, 51, 52, 53]. State tran- 
sition diagrams assist in the evaluation of the transitions impacted by the 
inter-domain dependencies in the cyber-physical systems. They describe how 
the attacker's actions cause transitions to failure states[10]. The main advan- 
tage of state transition models is the ability to provide a fine-granular system 
description which includes the dynamic behavior of system [54]. Moreover, 
these models are tailored to model immense and complex systems such as the 
smart grid. Markov chain is the basis for diverse state-space techniques in 
dependability analysis[33]. Markov models have often been adopted for soft- 
ware and hardware performance and dependability evaluation following from 
their capability to capture a variety of dependencies and the simplicity to 
compute steady-state, transient, and cumulative transient measures. Semi- 
Markov Process (SMP) is a generalization of both continuous and discrete 
time Markov chains which allows arbitrary state holding time distribution 
functions, probably relying on both the current state and on the state to be 
visited afterwards [55]. 

Availability and reliability analysis of the smart grid control center net- 
works using Stochastic Petri Nets (a kind of state-space models) has been 
provided in [33]. In this paper, we place focus on the security analysis of our 
proposed ITS for smart grid control centers with regard to availability and 
Mean Time To Security Failure (MTTSF) as performance measures using a 
semi Markov model. The analytical evaluation has been carried out using 
MATLAB simulator. 

5.1. System Model 

The state transition diagram derived is shown in Figure 7 and serves as a 
generic model for analyzing the behavior of various ITS architectures, includ- 
ing the proposed architecture. It incorporates different security related states 
of the ITS and their respective interrelationships. Table 2 presents these se- 
curity states and their corresponding descriptions. The system changes from 
one state to the other during its functional lifespan following from normal us- 
age, abuse, maintenance and corrective measures, failures, and so on. There- 
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Figure 7. State transition diagram for the proposed ITS. 

fore, the behavior of the system is portrayed as the transitions between the 
states and each transition corresponds to a specific event. Since the interval 
between the transition from one state to the other (i.e., state holding time or 
inter event time) is inclined to be random, its underlying process is defined 
as a stochastic process [54]. In our system, this process is associated with 
arbitrary probability distributions, thus, it can be modeled using an SMP. 
We can classify the possible transitions in Figure 7 according to their starting 
states as follows: 

1. Transition from the state G: A system free of vulnerabilities is envi- 
sioned as being in good state G. During the probing and scanning the 
system, the identification of vulnerabilities, makes it possible for an 
adversary to evade or overcome prevention and detection mechanisms 
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Table 2. Different states of the system and their respective descriptions. 



State 


Description 


G 


Good 


V 


Vulnerable 


I 


Intruded 


DMC 


Detected Masked Compromised 


UMC 


Undetected Masked Compromised 


UNC 


Undetected Not masked Compromised 


DNC 


Detected Not masked Compromised 


GD 


Graceful Degradation 


FS 


Fail-secure 


F 


Failed 



and violate the system's security policy. As a result, the system state 
changes from good state G to the vulnerable state V. Even if the system 
possesses potential vulnerabilities that may be abused by the malicious 
intent, it can be regarded as being in the vulnerable state. 

2. Transitions from the state V: Discovering a vulnerability (i.e., before 
an intrusion) and subsequently fixing it brings the system from the 
state V into the state G. The other possible transition occurs following 
from the successful exploitation of a vulnerability and will lead to the 
intruded state I. 

3. Transitions from the state I: There are four feasible transitions to other 
states from the state I. First, if the intrusion tolerance techniques em- 
ployed in the ITS fail to detect the intrusion and mitigate the impacts 
of an attack (i.e., mask the attacks impacts), the system goes to the 
state UNC with no service guarantee. Second, if the intrusion is de- 
tected and the intrusion tolerance techniques succeed in masking the 
attack' s impact, the state of the system will change from I to DMC. 
In this state, the intrusion is handled by faulty/compromised replica 
detector and rejuvenation modules. Third, if the intrusion goes un- 
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detected but masked through proactive recovery, a transition to state 
UMC is made. Subsequently, restoration without any service degrada- 
tion enables reaching the state G from states UMC or DMC This is 
where our state diagram differs from [51] in which the ITS architecture 
(i.e., SITAR) did not possess proactive recovery (corresponding to state 
UMC in our system). More specifically, the audit module in SITAR 
carries out periodic diagnosis tests to verify the correct operation of 
other components and forwards the results to adaptive reconfiguration 
module to take appropriate actions [50] which may include some type 
of reactive recovery. The last possible transition is to the state DNC 
when an intrusion is identified but the containment of the damage fails. 

4. Transitions from the state DNC: It is possible for an attacker to be able 
to compromise more than / replicas (e.g., in case of a DoS attack). 
This may result in complete system failure (transition from DNC to 
F) or entering states GD or FS. In the state GD, the system is only 
able to provide essential services which might have different definitions 
in various systems whereas in the state FS, the system would stop 
functioning. 

5. Transitions from F, GD, FS and UNC: The endpoint of all these transi- 
tions will be the state G. The aforementioned transitions would involve 
manual restoration and corrective maintenance. 

5.2. SMP Analysis 

An SMP can be studied by finding the embedded discrete time Markov 
chain that requires two sets of parameters [51] [53]: 

1. mean sojourn time (i.e., state holding time) for each state 

2. the transition probabilities between different states 

With respect to Figure 7, the Discrete Time Semi Markov Model (DTSMM) 
possesses a discrete state space X s = {G, V, I, UMC, DMC, DNC, UNC, FS, 
F} for which hi indicates the mean sojourn time in state i G X s and pij 
represents the transition probabilities between states i and j e X s ). 

5.3. Availability Formulation and Analysis 

Availability and service continuity as the most vital security attribute of 
the smart grid is required to be analyzed and evaluated for the proposed ITS 
architecture. Using an SMP model and the steady-state probabilities of its 
states assists in the steady-state availability analysis of the proposed ITS. 
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We analyze the sensitivity of the availability with respect to two parameters, 
including the probability of intrusion and the mean time to resist be- 
coming vulnerable to intrusions (h G ). In addition, a comparison between the 
proposed ITS and two of the well-known existing ITSs, namely SITAR and 
SCIT has been made using these parameters. 

The steady-state availability A is defined as the probability that the sys- 
tem is in one of normal functioning states. One approach to determine the 
availability is to pinpoint what the unavailable states (i.e., states FS, F and 
UNC) are. Thus, the steady-state availability A can be formulated as, 



A = 1 — (7T UNC + 7T FS + 7T F ) 



(1) 



where 7Tj, i G {UNC, FS, F} denotes the steady-state probability of being in 
state i for the SMP, that can be computed as, 



7T; = 



E "A 



(2) 



where hi indicates the mean state holding time in state i and Vi denotes the 
embedded Discrete Time Markov chain (DTMC) steady-state probability in 
state i. We can derive i>jS from the following two equations, 



v = v-P 

= 1, i e X s 



(3) 
(4) 



where the P is the transition probability matrix of the corresponding DTMC 
for the proposed ITS, 
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In this research, the mean state holding times hi for all the states of 
DTMC have been assumed to have the same values as[51] except for the 
state UMC which is a new state (corresponding to proactive recovery) for 
our proposed ITS. 

Finally, by using (l)-(4), the steady-state availability (A P ) of our proposed 
ITS is computed as, 

h G + fly + Plijll + /IdMcPdM + h mC PvN ~T~ ^UMCPUM + ^DNcPdN + ^GD^DnPgD + h fsPdnPfs 

+ h F p m p F 

(5) 

In a similar manner, the steady-state availability for SITAR (A SITAR ) and 
SCIT (A SCIT ) are derived as, 

^ j ^uncPiPun+/i f PiPdnPf + h FS PiP m p FS 

^■DMcPdM "I" ^UNCPUN "I" h D y C p D y + /IgdPdnPgD "I" ^FsPdnPfS "I" ^fPdnPf) 

(6) 

^4 scit = j ^P 1 ^ (7) 

^ + + PiOi + K MC p VM + /i F p F ) 

It should be pointed out that some of the transition probabilities may have 
different values or even may not be applicable for all three ITSs following from 
the fact that the three ITSs do not possess the same state space (DTSMM' 
s state space for SITAR does not include state UMC whereas SCIT does not 
contain the states DMC, DNC, UNC, GD and FS). 

Figure 8 and Figure 9 illustrate the availability performance of the pro- 
posed ITS, SITAR and SCIT with regard to p l and h G respectively. The 
steady-state availability is a decreasing function of p x for all three ITSs (Fig- 
ure 8). The availability for SCIT falls sharply when the probability of in- 
trusion increases compared to the other two ITSs. This is due to the fact 
that SCIT lacks detection capabilities and only uses periodic rejuvenation. 
Considering Figure 8, availability performance of the proposed ITS shows 
0.6% and 36% improvement compared to SITAR and SCIT respectively (us- 
ing the same values for parameters). Figure 9 shows the positive impact of 
increasing the time that the system is in the good state on the availability 
(i.e., the availability increases as the h G rises). For larger amount of h G , 
there is a slight difference in availability performance of the three ITS. In 
this figure, availability performance of the proposed ITS presents 0.3% and 
9% improvement compared to SITAR and SCIT respectively. This is mostly 
due the use of the hybrid and fine-grained recovery approach in the proposed 
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Figure 9. Availability vs h G . 
ITS that contributes to improve the system's availability. 
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5.3.1. SLA as another possible performance criterion 

Service Level Agreement (SLA) can be considered as another performance 
measure in critical infrastructures such as smart grid. The SLAs (service 
level agreements) are pre-defined agreements on some of the QoS parameters, 
including response time, delay, data rate, and so on. Considering SLAs 
based on response time can be a true assumption since all SLAs are expected 
to improve the observed response time. As stated in [56], having five nines 
availability does not suggest the guarantee of the system' s SLA. This means 
that even if a system is available most of the time, it may not meet the SLA 
requirements. This can be applied to the smart grid control centers in which 
satisfying the delay requirements is of utmost importance. Therefore, similar 
to the definition of the steady-state availability, the steady-state service level 
agreeability has been defined in [56]. They divided the sates of the system 
according to a threshold response time. The viable sets were highly SLA 
satisfying, SLA satisfying, SLA violating and highly SLA violating. Using 
the steady-state service level agreeability concept, we can group our state 
diagram states into the four aforementioned clusters. It is evident that in 
states G and V the system can satisfy the SLA completely while in the 
intruded state I and states DMC and UMC in which intrusion tolerance 
mask measures are taken, SLA satisfaction may not be the same as highly 
SLA satisfying class. In state DNC, the intrusion has been detected but 
cannot be masked and also in sate GD in which the system only provides the 
essential services, we expect degradatrion of service, thus the SLA is violated. 
Finally, state FS requires the system to stop functioning and states UNC and 
F that will result in the security failure of the system fall into the highly SLA 
violating group. To obtain the steady-state service level agreeability, we can 
compute the steady-state probabilities of the states using the SMP model 
and get a summation of these probabilities for the states included in each 
cluster. 

5.4- MTTSF Formulation and Analysis 

Analogous to the Mean Time To Failure (MTTF) as a quantitative relia- 
bility measure, MTTSF is a measure for quantifying the security of intrusion 
tolerant systems[51]. MTTSF is defined as the mean elapsed time for the sys- 
tem to reach one of the security-compromised states (also called absorbing 
states), provided that the system begins in state G. Using a similar approach 
to availability analysis, we analyze the MTTSF with regard to p 1 and h G 
parameters. We take advantage of an SMP with absorbing and transient 
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states. In the state transition diagram shown in Figure 7, the set of states 
X a = {UNC, GD, FS, F} are considered as the absorbing states (i.e., the 
probability of moving out of these states is zero). These states indicate the 
security compromised states. The rest of the states are called transient states 
and denoted by X t = {G, V, I, UMC, DMC, DNC}. The transition prob- 
ability Matrix M exhibits the transition probabilities between the transient 
states (i.e., Q) and the states originating from transient states to absorbing 
states (i.e., C) in an organized form. 
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As stated in [51], we can find the MTTSF by the following formula, 

MTTSF = V i h i ( 8 ) 
iex t 

where Vi indicates the average number of times the transient state i has 
been visited before the DTMC arrives at one of the absorbing states and hi 
indicates the mean state holding time in state i. 
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Let qi be the probability of start in state i (here, it is assumed that 
the DTMC starts in state G) and qji be the transition probability from the 
transient state j to the transient state i. So, the V^s can be computed through 
solving the system of equations, 

V i = q i + ^2v j q ji , i,jeX t (9) 
j 

Finally, we use (8) to calculate the MTTSF for the proposed ITS as, 

h G p 1 + hyp } + hq + h mic p m[ + ZluMcPuM ~l~ ^-dncPdn (10) 

1 Pdm Pum 

Using the same approach, we can find the formula for SITAR[51] and SCIT, 

, , hap' 1 + hyp' 1 + hj + h DMC p BM 

JW SITAR = 

J- Pmi 

M _ h G p~ l + feyp" 1 + ^ + fruMcguM , .-v 

SCIT — 1 _ I J 

As Figure. 10 illustrates, MTTSF has a reciprocal relationship with the 
probability of intrusion, i.e., it decreases as the probability of intrusion rises. 
The proposed ITS architecture shows improved MTTSF with regard to p Y 
since it has more possible security features (e.g., proactive and reactive re- 
covery) and thus more system states (corresponding to tolerance measures) 
when dealing with intrusions. It demonstrates advance in MTTSF perfor- 
mance compared to other two ITSs (17% compared to SITAR and 2% com- 
pared to SCIT). Figure 11 depicts the impact of increasing h G on MTTSF. It 
is obvious that MTTSF ascends when the system spends more time in state 
G. In this graph with assigned values to transition probabilities and state 
holding times, the proactive rejuvenation in SCIT seems to have more ef- 
fects on the MTTSF when increasing the h G in comparison with the reactive 
rejuvenation in SITAR. The acquired results show that the stability of our 
proposed ITS is better than the others. The improvement in MTTSF per- 
formance is 16% and 0.8% compared to SITAR and SCIT respectively. The 
acquired results for MTTSF also prove the security enhancement of the pro- 
posed architecture (mostly thanks to the combined rejuvenation approach) 
compared with the other two systems. 
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Figure 10. MTTSF vs p 7 . 
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Figure 11. MTTSF vs h G . 

5.5. Discussion and analysis 

As we know, self-healing capability is one of the distinctive features of the 
smart grid. From the acquired results in the previous sections, we can infer 
that the mask measures (reflected in the states DMC and UMC in the pro- 
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vided state transition diagram) and in particular, the self-healing capabilities 
(automatic rejuvenation) of the proposed ITS influences its performance to 
a considerable extent. From the Figure 7, we have, 

Pdm + Pum + Pun + Pdn = 1 (13) 

By considering p M (i.e., probability of masking an intrusion) as the sum of p DM 
and p UM as well as p N (i.e., probability of the inability to mask an intrusion) 
as the sum of p DN and p UN , we will have the following equation, 

p M +p N = l (14) 

A perfect and ideal ITS is expected to have the p M equal to one. Therefore, 
we should attempt to enhance the ITS masking capabilities in order to have 
a more robust and secure ITS architecture. In the proposed ITS architecture, 
we made an effort to increase the p M compared with the other two ITSs (i.e., 
SITAR and SCIT). 

6. Conclusion and future work 

This paper has provided an in-depth research on the significance of using 
intrusion tolerance as a promising security approach to improve the secu- 
rity of smart grid control centers. An ITS architecture was proposed to be 
adopted in control centers' critical components, particularly SCADA/EMS. 
Using different intrusion tolerance techniques such as replication, diversity, 
proactive and fine-grained reactive recovery made the proposed ITS outper- 
form two of well-known architectures, namely SITAR and SCIT. SITAR only 
possesses reactive recovery and SCIT leverages the periodic rejuvenation. In 
addition, in our proposed ITS, acceptance testing is only carried on the out- 
going data in contrast with SITAR which applies acceptance testing to both 
incoming and outgoing data. So, the response time of the proposed ITS archi- 
tecture is expected to decrease. Thus enhancing the security of the proposed 
ITS. The availability and MTTSF performance measures were analyzed via a 
Discrete Time Semi Markov Model (that can be used as a general model for 
assessing the security attributes of any ITS) and compared with other ITSs. 
In future, we will make an attempt to simulate the proposed architecture and 
evaluate its performance with respect to other performance metrics such as 
delay which is a critical feature for smart gird control centers. 
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